HTML Verison

Background

The trend of e-commerce blows sky high in the recent years, a lot of mortar retail cooperation turns themselves into the competition of e-commerce (e.g. Walmart , Target etc). The concept of this project is to utilize the dataset provided by Walmart to forecast the weekly sales in 45 stores based on set of features (e.g. Store Size,Unemployment rate, Customer Price Index (CPI) etc). The first part of data analysis result is able to reflect the past or current corporation financial condition for either single store or particular department. In addition, this experiment also explores the fact that how every feature impact on the business outcome which is one of crucial operations for a coroperation. In the second part of this analysis, predictive time-series model will be implemented to forecast the future sales in weekly basis. This analysis step is basically an indispensible step for a company to have a guide or estimate of the future direction of company's profit or even market stock price which helps the management team to prepare business stragegy for the next round of period

Client / Audience

The primary audience for this analysis is Walmart management team / executives. The findings from this research gives some hint for them to design their operation format or business plan. Furthermore, the other group of audience properly are those professional who concerns about the business condition about Walmart. The findings from the research definitly offers them some insight about the current or future sales in the mortar stores. The third group would be the people who wants to find out the sales pattern happening in Walmart store or particular department. The analysis also gives the insight how the seasonal factors changes the sales which is very valuable for professionals or merchant .

Questions to be addressed

- Discover the sales distribution in Store and individual department

- Discover the correlation of between features and Sales.

- Discover the features contributes differently to the predictive models

- Determine the best model(s) for forecasting the weekly sales (continuous values)?

- How’s the models perform differently in various States?

Dataset Exploration

Summary

  • There are 421570 instances in this dataset. Among the dataset, there are total 45 retail Stores sales result in this dataset. In each store, there are about 99 Departments.

Missing Values

  • Before we get into the analysis, we first deal with the missing values. As we observe the following, the missing values are constantly stacking on Five features which are MarkDown1, MarkDown2, MarkDown3, MarkDown4, MarkDown5. Besides, missing values are not found among other features. For this reason, we are going to grub deeper into those features to check where those missing values are correlatable enough to be replaced.

  • The following is the basic proportion of missing values for every store. Based on the table 1, we could observe that the proportion of missing among these five features are ranging between (63%-90%). With these high volumes of missing values, we are going to grub more to check whether the values are randomly missing or not. Table 2 shows that the range of data without missing values from 2011-11-11 to 2012-10-26 and Table 3 shows that the range of data with missing values from 2010-02-05 to 2011-11-04. In this case, we could observe that the large volume of missing values are NOT randomly missed. In the consideration of the huge volumes as well as the missing patterns, I would propose to drop the five features (i.e. columns) instead of records (i.e. rows) because this is important to keep as many records as we could for the time-series analysis.

Selected Features

  • isHoliday(bnary)
  • Temperature(continuous)
  • Fuel_Price(continuous)
  • Customer Price Index CPI(continuous)
  • Store Type(Binary)
  • Unemployment Rate(continuous)

Categorical Features

  • This dataset contains two categorical features which are "IsHoliday" and "Store_Type" variables. In order to proceed time-series analysis, these Cat attributes will be dummied into individual variables as Table 4. As you see, the binary "IsHoliday" and "Store_Type" features are divided into 2 columns (i.e. 1,0) and 3 columns (i.e. A,B,C)

Outliers

  • Weekly_Sales: Due to geographical or holiday features, the Sales associated with every store should varies. Refer to Figure 1, the Weekly sales mot only varies quite a bit, a considerablely large amount data point are out of the range of typical range of sales. As mentioned, the variation might be caused by mixed factors. For this reason, I may just leave those "Outliers" as is for now.

  • Temperature: Refer to figure 2, the other variable "Temperature" looks very consistant across the stores. Also, there are NO outliers significantly out of their typical range among the stores.

  • Fuel Price: Refer to figure 3, the feature "Fuel_Price" which is even more consistant than the "Temperature" variable ranges between 2.5 to 4.5 among the stores. So that I would believe that there is NO outliers existing in this feature. This figure makes sense to us because Fuel Price has its market Price across the countries, so that there should be not much variation in each geographical area as expected.

  • Consumer Price Index (CPI): According to U.S. Bureau of Labor Statistics, the meaning of CPI is "a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services". Refer to Figure 4, the CPI value, ranging from 120 to 230, differs quite much among the stores. However, there is NO existing outliers shown in this dataset. This figure relatively indicates us the consuming power across stores. For hypothetical assumption, the area with higher CPI values should expect more consuming power which is supposed to generate more sales amount for the store, or vice versa. Further analysis will be conducted to measure this aspect of feature along with this analysis proceed.

  • Unemployment: The unemployment rate is another index shows the consuming power in the country. Refer to the Figure 5, the data shows that most of the values fall between 6 to 8 withouth outliers among the stores. Similar to the "CPI" feature, Further analysis will be conducted to measure how this feature influence to the Weekly Sales among the stores.

In [1]:
import pandas as pd
import numpy as np

trainDf = pd.read_csv("./data/train.csv")
feaDf = pd.read_csv("./data/features.csv").drop(["IsHoliday"], axis=1)
storesDf = pd.read_csv("./data/stores.csv")


df1 = pd.merge(trainDf, feaDf, on = ["Store", "Date"])
#print(df1.shape)
train = pd.merge(df1, storesDf, on = "Store")
print("Dataset size:",train.shape)

#Rename column names
train.columns = ['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday', 'Temperature',
       'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4',
       'MarkDown5', 'CPI', 'Unemployment', 'Store_Type', 'Size']

display(train.head(3))
Dataset size: (421570, 16)
Store Dept Date Weekly_Sales IsHoliday Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment Store_Type Size
0 1 1 2010-02-05 24924.50 False 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 A 151315
1 1 2 2010-02-05 50605.27 False 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 A 151315
2 1 3 2010-02-05 13740.12 False 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 A 151315
In [2]:
cntLst = []
print("Table 1")
print('''Proportion of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"''')
print()
for store in train.Store.unique():
    for var in ["MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"]:
        cntLst.append(train[var][train.Store == store].isnull().sum())
    print("Store #%d"%(store), end="")
    print(" with %d instance has missing values"%(train[train.Store == store].shape[0]), end=" ")
    print(cntLst, end=" ")
    print("in percent", end=" ")
    print(np.round(np.array(cntLst) / train[train.Store == store].shape[0], 2)*100, end="\n")
    cntLst=[]
Table 1
Proportion of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"

Store #1 with 10244 instance has missing values [6587, 7229, 6656, 6587, 6587] in percent [64. 71. 65. 64. 64.]
Store #2 with 10238 instance has missing values [6575, 7218, 6646, 6575, 6575] in percent [64. 71. 65. 64. 64.]
Store #3 with 9036 instance has missing values [5791, 6554, 6230, 5922, 5791] in percent [64. 73. 69. 66. 64.]
Store #4 with 10272 instance has missing values [6596, 7171, 6739, 6668, 6596] in percent [64. 70. 66. 65. 64.]
Store #5 with 8999 instance has missing values [5776, 6660, 6279, 5906, 5776] in percent [64. 74. 70. 66. 64.]
Store #6 with 10211 instance has missing values [6559, 7057, 6702, 6559, 6559] in percent [64. 69. 66. 64. 64.]
Store #7 with 9762 instance has missing values [6256, 7209, 6396, 6256, 6256] in percent [64. 74. 66. 64. 64.]
Store #8 with 9895 instance has missing values [6360, 6982, 6496, 6429, 6360] in percent [64. 71. 66. 65. 64.]
Store #9 with 8867 instance has missing values [5657, 6791, 6225, 5783, 5657] in percent [64. 77. 70. 65. 64.]
Store #10 with 10315 instance has missing values [6669, 7736, 6957, 6669, 6669] in percent [65. 75. 67. 65. 65.]
Store #11 with 10062 instance has missing values [6438, 7003, 6508, 6580, 6438] in percent [64. 70. 65. 65. 64.]
Store #12 with 9705 instance has missing values [6211, 6961, 6417, 6211, 6211] in percent [64. 72. 66. 64. 64.]
Store #13 with 10474 instance has missing values [6745, 7330, 6816, 6745, 6745] in percent [64. 70. 65. 64. 64.]
Store #14 with 10040 instance has missing values [6451, 7157, 6520, 6451, 6451] in percent [64. 71. 65. 64. 64.]
Store #15 with 9901 instance has missing values [6359, 7055, 6705, 6359, 6359] in percent [64. 71. 68. 64. 64.]
Store #16 with 9443 instance has missing values [5961, 6918, 6638, 5961, 5961] in percent [63. 73. 70. 63. 63.]
Store #17 with 9864 instance has missing values [6304, 7285, 6864, 6304, 6304] in percent [64. 74. 70. 64. 64.]
Store #18 with 9859 instance has missing values [6295, 6921, 6506, 6295, 6295] in percent [64. 70. 66. 64. 64.]
Store #19 with 10148 instance has missing values [6526, 7095, 6738, 6526, 6526] in percent [64. 70. 66. 64. 64.]
Store #20 with 10214 instance has missing values [6561, 6992, 6631, 6561, 6561] in percent [64. 68. 65. 64. 64.]
Store #21 with 9582 instance has missing values [6133, 6808, 6402, 6133, 6133] in percent [64. 71. 67. 64. 64.]
Store #22 with 9688 instance has missing values [6212, 7037, 6552, 6212, 6212] in percent [64. 73. 68. 64. 64.]
Store #23 with 10050 instance has missing values [6464, 7169, 6743, 6464, 6464] in percent [64. 71. 67. 64. 64.]
Store #24 with 10228 instance has missing values [6570, 7072, 6641, 6570, 6570] in percent [64. 69. 65. 64. 64.]
Store #25 with 9804 instance has missing values [6296, 6915, 6569, 6296, 6296] in percent [64. 71. 67. 64. 64.]
Store #26 with 9854 instance has missing values [6310, 7002, 6446, 6382, 6310] in percent [64. 71. 65. 65. 64.]
Store #27 with 10225 instance has missing values [6569, 7143, 6640, 6569, 6569] in percent [64. 70. 65. 64. 64.]
Store #28 with 10113 instance has missing values [6494, 7058, 6704, 6494, 6494] in percent [64. 70. 66. 64. 64.]
Store #29 with 9455 instance has missing values [6088, 6946, 6352, 6088, 6088] in percent [64. 73. 67. 64. 64.]
Store #30 with 7156 instance has missing values [4693, 6338, 5287, 6241, 4591] in percent [66. 89. 74. 87. 64.]
Store #31 with 10142 instance has missing values [6528, 7233, 6598, 6528, 6528] in percent [64. 71. 65. 64. 64.]
Store #32 with 10202 instance has missing values [6573, 7144, 6645, 6573, 6573] in percent [64. 70. 65. 64. 64.]
Store #33 with 6487 instance has missing values [4194, 5552, 5547, 6156, 4099] in percent [65. 86. 86. 95. 63.]
Store #34 with 10224 instance has missing values [6616, 7246, 6828, 6616, 6616] in percent [65. 71. 67. 65. 65.]
Store #35 with 9528 instance has missing values [6112, 7582, 6315, 6112, 6112] in percent [64. 80. 66. 64. 64.]
Store #36 with 6222 instance has missing values [4058, 5382, 5380, 5739, 3968] in percent [65. 86. 86. 92. 64.]
Store #37 with 7206 instance has missing values [4750, 6266, 5112, 6944, 4542] in percent [66. 87. 71. 96. 63.]
Store #38 with 7362 instance has missing values [4658, 6414, 5241, 6235, 4658] in percent [63. 87. 71. 85. 63.]
Store #39 with 9878 instance has missing values [6356, 7046, 6492, 6356, 6356] in percent [64. 71. 66. 64. 64.]
Store #40 with 10017 instance has missing values [6418, 7268, 6699, 6418, 6418] in percent [64. 73. 67. 64. 64.]
Store #41 with 10088 instance has missing values [6439, 7083, 6509, 6439, 6439] in percent [64. 70. 65. 64. 64.]
Store #42 with 6953 instance has missing values [4507, 6044, 4862, 6443, 4352] in percent [65. 87. 70. 93. 63.]
Store #43 with 6751 instance has missing values [4341, 6045, 4715, 6464, 4341] in percent [64. 90. 70. 96. 64.]
Store #44 with 7169 instance has missing values [4649, 6414, 5213, 6600, 4548] in percent [65. 89. 73. 92. 63.]
Store #45 with 9637 instance has missing values [6184, 6791, 6318, 6184, 6184] in percent [64. 70. 66. 64. 64.]
In [3]:
cntLst = []
print("Table 2")
print('''Date range WITHOUT Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"''', end="\n")
print()
for store in train.Store.unique():
    arr = sorted(pd.to_datetime(train.Date[train.MarkDown1.isnull() == False][train.Store == store].unique()))
    print("Store %d %s - %s"%(store,min(arr),max(arr)))
Table 2
Date range WITHOUT Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"

Store 1 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 2 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 3 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 4 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 5 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 6 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 7 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 8 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 9 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 10 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 11 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 12 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 13 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 14 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 15 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 16 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 17 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 18 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 19 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 20 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 21 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 22 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 23 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 24 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 25 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 26 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 27 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 28 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 29 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 30 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 31 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 32 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 33 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 34 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 35 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 36 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 37 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 38 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 39 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 40 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 41 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 42 2011-11-11 00:00:00 - 2012-10-19 00:00:00
Store 43 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 44 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 45 2011-11-11 00:00:00 - 2012-10-26 00:00:00
In [4]:
cntLst = []
print("Table 3")
print('''Date range of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"''', end="\n")
print()
for store in train.Store.unique():
    arr = sorted(pd.to_datetime(train.Date[train.MarkDown1.isnull() == True][train.Store == store].unique()))
    print("Store %d %s - %s"%(store,min(arr),max(arr)))
Table 3
Date range of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"

Store 1 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 2 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 3 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 4 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 5 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 6 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 7 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 8 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 9 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 10 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 11 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 12 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 13 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 14 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 15 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 16 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 17 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 18 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 19 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 20 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 21 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 22 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 23 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 24 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 25 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 26 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 27 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 28 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 29 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 30 2010-02-05 00:00:00 - 2012-10-05 00:00:00
Store 31 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 32 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 33 2010-02-05 00:00:00 - 2012-06-29 00:00:00
Store 34 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 35 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 36 2010-02-05 00:00:00 - 2011-12-09 00:00:00
Store 37 2010-02-05 00:00:00 - 2012-08-31 00:00:00
Store 38 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 39 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 40 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 41 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 42 2010-02-05 00:00:00 - 2012-10-26 00:00:00
Store 43 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 44 2010-02-05 00:00:00 - 2011-12-09 00:00:00
Store 45 2010-02-05 00:00:00 - 2011-11-04 00:00:00
In [5]:
print("Table 4")
print('''Split the feature "Type" into A,B,C and "IsHoliday_x" into 1,0 ''')
#display(pd.get_dummies(train[["IsHoliday_x"]]))
pd.concat((pd.get_dummies(train["Store_Type"]),pd.get_dummies(train["IsHoliday"])), axis=1).head(5)
Table 4
Split the feature "Type" into A,B,C and "IsHoliday_x" into 1,0 
Out[5]:
A B C False True
0 1 0 0 1 0
1 1 0 0 1 0
2 1 0 0 1 0
3 1 0 0 1 0
4 1 0 0 1 0
In [6]:
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.offline as offline

init_notebook_mode(connected=True)

namesLst=list(map(str,train.Store.unique()))

aa=[]
salesLst=[]
for name in namesLst:
    aa=round(train.Weekly_Sales[train.Store == int(name)]/1000,2)
    salesLst = salesLst + [aa]

traces=[]

for name, sales in zip (namesLst, salesLst):
    traces.append(go.Box(
        x=sales,
        name=name
    ))

layout = go.Layout(
    title = 'Range of Sales Values by each store (Figure 1)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=False,
    width=700,
    height=1000,
    yaxis =dict(title = "Store Number", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="Weekly Sales (in thousands)",
                 titlefont=dict(size=18),
                 zeroline=True, range=[-10,200], showgrid=True),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False
)
    
fig = go.Figure(data=traces, layout=layout)
iplot(fig, show_link=True)
In [7]:
namesLst=list(map(str,train.Store.unique()))

aa=[]
salesLst=[]
for name in namesLst:
    aa=round(train.Temperature[train.Store == int(name)],2)
    salesLst = salesLst + [aa]

traces=[]

for name, sales in zip (namesLst, salesLst):
    traces.append(go.Box(
        x=sales,
        name=name
    ))

layout = go.Layout(
    title = 'Range of Temperature Values by each store (Figure 2)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=False,
    width=700,
    height=1000,
    yaxis =dict(title = "Store Number", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="Temperature(F)",
                 titlefont=dict(size=18),
                 zeroline=True, range=[-5,120], showgrid=True),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False
)
    
fig = go.Figure(data=traces, layout=layout)
iplot(fig)
In [8]:
namesLst=list(map(str,train.Store.unique()))

aa=[]
salesLst=[]
for name in namesLst:
    aa=round(train.Fuel_Price[train.Store == int(name)],2)
    salesLst = salesLst + [aa]

traces=[]

for name, sales in zip (namesLst, salesLst):
    traces.append(go.Box(
        x=sales,
        name=name
    ))

layout = go.Layout(
    title = 'Range of Fuel Price Values by each store (Figure 3)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=False,
    width=700,
    height=1000,
    yaxis =dict(title = "Store Number", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,
                ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="Fuel Price($)",
                 titlefont=dict(size=18),
                 zeroline=True, 
                 range=[2,5], 
                 showgrid=True),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False
)
    
fig = go.Figure(data=traces, layout=layout)
iplot(fig)
In [9]:
namesLst=list(map(str,train.Store.unique()))

aa=[]
salesLst=[]
for name in namesLst:
    aa=round(train.CPI[train.Store == int(name)],2)
    salesLst = salesLst + [aa]

traces=[]

for name, sales in zip (namesLst, salesLst):
    traces.append(go.Box(
        x=sales,
        name=name
    ))

layout = go.Layout(
    title = 'Range of CPI Values by each store (Figure 4)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=False,
    width=700,
    height=1000,
    yaxis =dict(title = "Store Number", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="CPI",
                 titlefont=dict(size=18),
                 zeroline=True, range=[100,250], showgrid=True),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False
)
    
fig = go.Figure(data=traces, layout=layout)
iplot(fig)
In [15]:
namesLst=list(map(str,train.Store.unique()))

aa=[]
salesLst=[]
for name in namesLst:
    aa=round(train.Unemployment[train.Store == int(name)],2)
    salesLst = salesLst + [aa]

traces=[]

for name, sales in zip (namesLst, salesLst):
    traces.append(go.Box(
        x=sales,
        name=name
    ))

layout = go.Layout(
    title = 'Range of Unemployment Rate by each store (Figure 5)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=False,
    width=700,
    height=1000,
    yaxis =dict(title = "Store Number", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="Unemployment Rate",
                 titlefont=dict(size=18),
                 zeroline=True, range=[0,20], showgrid=True),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False
)
    
fig = go.Figure(data=traces, layout=layout)
iplot(fig)

Data Analysis


Weekly Sales Summary

Let's first take a close look on the trend of weekly sales during this two year period. The sales plot captures the sales trend ranges from Feb-2011 - Oct-2012. As we can see that the sales ranges approximate between 0 to 4 million. The sales trends for each store move flat over time except the two holiday seasons which start approximately from Nov 20 to Dec 31 for both years. As we all know, this period covers two big holidays which are Thanksgiving and Christmas retailers traditionally offers huge discount discount to customers. During these high selling period, the sales hike 1.5 - 2 times of the original sales. So that we ought to pay more attention to the prediction for this holiday period.

In [13]:
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.offline as offline

init_notebook_mode(connected=True)
In [11]:
#Reload Data

import pandas as pd
import numpy as np

trainDf = pd.read_csv("./data/train.csv")
feaDf = pd.read_csv("./data/features.csv").drop(["IsHoliday"], axis=1)
storesDf = pd.read_csv("./data/stores.csv")


df1 = pd.merge(trainDf, feaDf, on = ["Store", "Date"])
#print(df1.shape)
train = pd.merge(df1, storesDf, on = "Store")
print("Dataset size:",train.shape)

#Rename column names
train.columns = ['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday', 'Temperature',
       'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4',
       'MarkDown5', 'CPI', 'Unemployment', 'Store_Type', 'Size']
Dataset size: (421570, 16)
In [12]:
# Dropping Features
train.drop(columns=["MarkDown1","MarkDown2","MarkDown3","MarkDown4", "MarkDown5"], inplace = True)
In [14]:
#Gather Weekly Sales for each store
namesLst=list(map(str,train.Store.unique()))
dateLst = train.Date.unique()

aa=[]
salesLst=[]
timeLst=[]
storeSalesDict={}
for name in namesLst:
    #print(name)
    storeSalesDict[int(name)]=[]
    for date in dateLst:
        aa=round(train.Weekly_Sales[train.Store == int(name)][train.Date == date].sum()/ 1000000,2)
        storeSalesDict[int(name)].append(aa)
In [15]:
traces=[]

for i, sales in storeSalesDict.items():
    #print(i, sales)
    traces.append(go.Scatter(
    x= pd.to_datetime(dateLst),
    y=sales,
    name="Store#"+str(i)
    ))
In [16]:
import numpy as np
aa= np.zeros((len(storeSalesDict),len(storeSalesDict)))
aa = [[str(aa[i][j]) for i in range(0,aa.shape[0])] for j in range(0,aa.shape[1])]
aa = [[False for i in range(0,len(aa))] for j in range(0,len(aa))]
for i in range(len(aa)):
    aa[i][i] = True
In [17]:
updatemenus = list([
    dict(type="buttons",
         active=-1,
         buttons=list([
            dict(label = 'Store 1',
                 method = 'update',
                 args = [{'visible': aa[0]},
                         {'title': 'Store 1 Sales'}])]))])
In [20]:
layout = go.Layout(
    title = 'Weekly Sales (Figure 6)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=True,
    #width=700,
    #height=1000,
    yaxis =dict(title = "Weekly Sales (in Millions)", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="Time Series (in Months)",
                 titlefont=dict(size=18),
                 exponentformat='e',
                 showexponent='all',
                 zeroline=True,  
                 showgrid=False,
                 rangeselector=dict(visible=True, 
                                    buttons=list([
                                        dict(count=1, 
                                             label="1m", 
                                             step="month", 
                                             stepmode ="backward"),
                                        dict(count=3, 
                                                 label="3m", 
                                                 step="month", 
                                                 stepmode ="backward"),
                                         dict(count=6, 
                                                 label="6m", 
                                                 step="month", 
                                                 stepmode ="backward"),
                                         dict(count=12, 
                                                 label="12m", 
                                                 step="month", 
                                                 stepmode ="backward"),
                                        dict(step ="all")])),
                 rangeslider = dict(visible=True)
                ),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False,
    updatemenus=updatemenus
)
In [21]:
fig = go.Figure(data=traces, layout=layout)
iplot(fig, show_link=True)

Weekly Sales in Store_Type

The figure 7 below shows the total sales generated by different store type (i.e. A, B, C). Obiviously, the sales in regular working holidays takes the major proportion of the sales in holidays because holidays like Christmas and Thanksgiving only takes small portion throughout the year. However, in figure 8, we could observe that the average weekly sales in holidays is higher than weekly sales in working days for Type A, B stores. Sales are basically NO variant for type C store. To be more precise determine their sales performance between holidays or non-holidays, I would propose to set up hypothesis test to further verify.

In [23]:
trace1 = []

salesA = train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==False].sum()
salesB = train.Weekly_Sales[train.Store_Type=="B"][train.IsHoliday==False].sum()
salesC = train.Weekly_Sales[train.Store_Type=="C"][train.IsHoliday==False].sum()

trace1 = go.Bar(
    x = sorted(list(set(train.Store_Type))),
    y = [salesA, salesB, salesC],
    name = "Working day"
)

salesA = train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==True].sum()
salesB = train.Weekly_Sales[train.Store_Type=="B"][train.IsHoliday==True].sum()
salesC = train.Weekly_Sales[train.Store_Type=="C"][train.IsHoliday==True].sum()

trace2=[]
trace2 = go.Bar(
    x = sorted(list(set(train.Store_Type))),
    y = [salesA, salesB, salesC],
    name = "Holiday"
)

data = [trace1, trace2]
layout = go.Layout(
    title = "Store Type Sale (%s to %s) - Total (Figure 7) "% (min(train.Date), max(train.Date)),
    barmode='group',
    showlegend=True,
    xaxis = dict(title = "Store Type"),
    yaxis = dict(title = " Total Sales")
)

fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=True)
In [25]:
#salesA_len= len(list(train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==False]))
salesA_len = len(list(set(train.Date[train.Store_Type=="A"][train.IsHoliday==False])))
salesA_tot = train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==False].sum()
salesA_avg = salesA_tot/ salesA_len

salesB_len= len(list(set(train.Date[train.Store_Type=="B"][train.IsHoliday==False])))
salesB_tot = train.Weekly_Sales[train.Store_Type=="B"][train.IsHoliday==False].sum()
salesB_avg = salesB_tot/ salesB_len

salesC_len= len(list(set(train.Date[train.Store_Type=="C"][train.IsHoliday==False])))
salesC_tot = train.Weekly_Sales[train.Store_Type=="C"][train.IsHoliday==False].sum()
salesC_avg = salesC_tot/ salesC_len

trace1 = go.Bar(
    x = sorted(list(set(train.Store_Type))),
    y = [salesA_avg, salesB_avg, salesC_avg],
    name = "Working day - " + str(salesA_len) + " days"
)

salesA_len= len(list(set(train.Date[train.Store_Type=="A"][train.IsHoliday==True])))
salesA_tot = train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==True].sum()
salesA_avg = salesA_tot/ salesA_len

salesB_len= len(list(set(train.Date[train.Store_Type=="B"][train.IsHoliday==True])))
salesB_tot = train.Weekly_Sales[train.Store_Type=="B"][train.IsHoliday==True].sum()
salesB_avg = salesB_tot/ salesB_len

salesC_len= len(list(set(train.Date[train.Store_Type=="C"][train.IsHoliday==True])))
salesC_tot = train.Weekly_Sales[train.Store_Type=="C"][train.IsHoliday==True].sum()
salesC_avg = salesC_tot/ salesC_len

trace2 = go.Bar(
    x = sorted(list(set(train.Store_Type))),
    y = [salesA_avg, salesB_avg, salesC_avg],
    name = "Holiday - "+str(salesA_len) + " days"
)

data = [trace1, trace2]
layout = go.Layout(
    title = "Store Type Sale (%s to %s) - Weekly (Figure 8) "% (min(train.Date), max(train.Date)),
    barmode='group',
    showlegend=True,
    xaxis = dict(title = "Store Type"),
    yaxis = dict(title = " Weekly Sales ")
)

fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=True)

Weekly Sales in Department

The figure 9 below shows the ranking of Department in terms of Weekly Sales. The highest weekly sales which is about $75000 is generated by Dept 92; The lowest weekly which is -#7.68 is generated by store 47. This plot could demonstrate the sales deficiency between stores.

In [27]:
sales_dept = []
dept_lst=[]
lst=[]
for dept in list(set(train.Dept)):
    lst.append(tuple((dept,round(train.Weekly_Sales[train.Dept == dept].sum()/len(train.Weekly_Sales[train.Dept == dept]),2))))

df = pd.DataFrame(lst).sort_values(by=[1], ascending=True)
dept_lst = list(df.iloc[:,0])
sales_dept = list(df.iloc[:,1])
    
trace3 = []
trace3 = go.Bar(
    x = list(df.iloc[:,1]),
    y = list(map(str,list(df.iloc[:,0]))),
    orientation="h"
    #name = "Working day"
)

data = [trace3]
layout = go.Layout(
    #barmode='group',
    width=700,
    height=1200, 
    #showlegend=True
    title = 'Average Weekly Sales Ranking (Figure 9)',
    yaxis =dict(title = "Department Number", 
                #exponentformat='e',
                #showexponent='all',
                titlefont=dict(size=18),
                tick0=5,
                ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=False,
                type='category'),
    xaxis = dict(title="Weekly Sales($)",
                 titlefont=dict(size=18),
                 zeroline=True, 
                 #range=[2,5], 
                 showgrid=True)
)

annotations=[]
for i in range(len(dept_lst)):
    annotations.append(dict(x=sales_dept[i], 
                            y=dept_lst[i],
                            text="%0.2f"%(sales_dept[i]),
                            font=dict(family='Arial', 
                                      size=12,
                                      color='red'),
                            showarrow=True,
                            align = "center",
                            ax=40,
                            ay=0,
                            arrowhead=0))
    layout['annotations'] = annotations


fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=True)

Correlation between with Target values

The following scatter plots rougly shows the correlations between the Predictor variables and Target variable.

  • Weekly_Sales vs Temperature
  • Weekly_Sales vs Fuel_Price
  • Weekly_Sales vs CPI
  • Weekly_Sales vs Size

Unfortunately, it is difficult to determine the intensity of correlation between those features against Weekly Sales except the Store Size (i.e. Seems the higher the Store Size leads to higher Store Weekly Sales). To further verify, hypotheses tests will be performed afterward.

In [ ]:
'''*************************No need to Run*****************'''
wsLst=[]
tempLst=[]
fpLst=[]
cpiLst=[]
sizeLst=[]
storeLst=[]
dateLst=[]

for date in list(set(train.Date)):
    print(date)
    for store in list(set(train.Store)):
        storeLst.append(store)
        dateLst.append(date)
        wsLst.append((train.Weekly_Sales[train.Store == store][train.Date == date]).sum())
        tempLst.append(list(set(train.Temperature[train.Store == store][train.Date == date]))[0])
        fpLst.append(list(set(train.Fuel_Price[train.Store == store][train.Date == date]))[0])
        cpiLst.append(list(set(train.CPI[train.Store == store][train.Date == date]))[0])
        sizeLst.append(list(set(train.Size[train.Store == store][train.Date == date]))[0])
        
pd.DataFrame(list(zip(storeLst,dateLst,wsLst, tempLst, fpLst, cpiLst, sizeLst)), 
             columns= ["store","date","Weekly_Sales", "Temperture", "Fuel_Price", "CPI", "Size"]).to_csv("Weekly_Sales_Corr2.csv")        
In [29]:
'''****************Read the Weekly Sales Correlated Feature Table************'''
df = pd.read_csv("Weekly_Sales_Corr2.csv", )
ws=list(df.Weekly_Sales)
temp=list(df.Temperture)
fp=list(df.Fuel_Price)
cpi=list(df.CPI)
size=list(df.Size)
#df.head(5,)
In [31]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from numpy import arange,array,ones
from scipy import stats
import seaborn as sns


plt.figure(figsize=(10,10))
sns.regplot(x="Fuel_Price", y="Weekly_Sales", data=df)
plt.title("Fuel Price vs Weekly_Sales", fontsize=20)
plt.show()
/Users/KevQuant/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

In [34]:
plt.figure(figsize=(10,10))
sns.regplot(x="Temperture", y="Weekly_Sales", data=df)
plt.title("Temperature vs Weekly_Sales", fontsize=20)
plt.show()
/Users/KevQuant/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

In [32]:
plt.figure(figsize=(10,10))
sns.regplot(x="CPI", y="Weekly_Sales", data=df)
plt.title("CPI vs Weekly_Sales", fontsize=20)
plt.show()
/Users/KevQuant/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

In [33]:
plt.figure(figsize=(10,10))
sns.regplot(x="Size", y="Weekly_Sales", data=df)
plt.title("Store Size vs Weekly_Sales", fontsize=20)
plt.show()
/Users/KevQuant/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

Heatmap

Dependent vs Independent Variables

With the hint from above correlation plots, the store size (Independent var.) correlates the most with the Store Weekly_Sales (Dependent var). To be able to verfy this matter, we are going to show correlation values in numeric format as the following heatmap. The values reflects the same result that store size is the only feature has a correlation value higher than 0.5. The rest of the features show very least correlation with the dependent variable Weekly_sales. In this case, I would suggest to compare the prediction results from the models which are with or without least correlated features. Hypothesis Tests will be proceeded to verify this results.

Independent vs Independent Variables

For the correlation between independent variables, we could also observe that they don't have high correlation. I would not consider to remove any of the features at this point. Same as above, Hypothesis Tests will be proceeded to verify this results.

In [39]:
import matplotlib.pyplot as plt
import seaborn as sns

df2 = df.copy()
df2.drop(columns=["Unnamed: 0","store","date"], inplace=True)

df2_corr =df2.corr()
sns.heatmap(df2_corr, linewidths=0.5, annot=True)
plt.xticks(rotation=45)
plt.title("Correlation Heatmap", fontsize=15)
plt.show()

Hypotheses Test: (Dependent vs Independent Variables)


  • Null Hypothesis H0: Features (i.e. Temperature, Fuel_Price, CPI, Store Size) completely NOT correlates with the Store Weekly Sales.
  • Alt. Hypothesis Ha: Features correlates with the Store Weekly Sales.

Acccording to the following Hypotheses Test for the features Temperature, CPI and Store Size with either (Z = -5) < -3 or (Z = 65) > 3 which means the p-value is more than 3 standard deviation below the mean. In other words, it shows more than 99.7% of chance the correlation of samples are not the same as the population mean. So that we have strong evidence to reject the Null Hypotheses and proves that all of the features, except Fuel_Price, correlates with Store Weekly Sales. The result also aligns the correlation values above heatmap, e.g. Correlation of FP vs Weekly_Sales = 0.0095. As mentioned, I would not remove any features during the initial Machine Learning process even though some of the features have low correlation values against Weekly_Sales(Dependant Variable). In addition, we might want to compare the performance of models which are built with or without those least correlated features.

In [40]:
#Calculate pearson correlation function
def pearson_r(x,y):
    corr_mat = np.corrcoef(x,y)
    return corr_mat[0,1]

#Calculate P-value function
def p_value(array, value):
    return (value - np.mean(array))/np.std(array)
In [41]:
xi = temp
yi = ws

No_bootstrap_trial = len(ws)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Temperature vs Weekly Sales is", p_val[0,1])
The Z value for Temperature vs Weekly Sales is -5.12599346608834
In [42]:
xi = fp
yi = ws

No_bootstrap_trial = len(ws)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Fuel_Price vs Weekly Sales is", p_val[0,1])
The Z value for Fuel_Price vs Weekly Sales is 0.7487650321211525
In [43]:
xi = cpi
yi = ws

No_bootstrap_trial = len(ws)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for CPI vs Weekly Sales is", p_val[0,1])
The Z value for CPI vs Weekly Sales is -5.831720613702725
In [44]:
xi = size
yi = ws

No_bootstrap_trial = len(ws)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Store Size vs Weekly Sales is", p_val[0,1])
The Z value for Store Size vs Weekly Sales is 65.6174496224247

Hypotheses Test: (Store Size vs Other Independent Variables)


  • Null Hypothesis H0: Store Size completely NOT correlates with other features (i.e. Temperature, Fuel_Price, CPI).
  • Alt. Hypothesis Ha: Store Size correlates with other features.

Acccording to the following Hypotheses Test for the features Temperature, CPI and Store Size with either (Z = -5) < -3 or (Z = 65) > 3 which means the p-value is more than 3 standard deviation below the mean. In other words, it shows more than 99.7% of chance the correlation of samples are not the same as the population mean. So that we have strong evidence to reject the Null Hypotheses and proves that the Store Size correlates with Temperature. On the other hand, Store size does not correlate with other TWO features (i.e. CPI , Fuel Price). The result also aligns the correlation values above heatmap above.

In [45]:
xi = size
yi = cpi

No_bootstrap_trial = len(size)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Store Size vs CPI is", p_val[0,1])
The Z value for Store Size vs CPI is -0.7824637189185721
In [46]:
xi = size
yi = fp

No_bootstrap_trial = len(size)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Store Size vs Fuel Price is", p_val[0,1])
The Z value for Store Size vs Fuel Price is 0.726107704446794
In [47]:
xi = size
yi = temp

No_bootstrap_trial = len(size)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Store Size vs Temperature is", p_val[0,1])
The Z value for Store Size vs Temperature is -7.463314421009281

Hypotheses Test: (Temperature vs Other Independent Variables)


  • Null Hypothesis H0: Temperature completely NOT correlates with other features (i.e. Fuel_Price, CPI).
  • Alt. Hypothesis Ha: Temperature correlates with other features.

Acccording to the following Hypotheses Test for the features Temperature, CPI and Store Size with either (Z = -5) < -3 or (Z = 65) > 3 which means the p-value is more than 3 standard deviation below the mean. In other words, it shows more than 99.7% of chance the correlation of samples are not the same as the population mean. So that we have strong evidence to reject the Null Hypotheses and proves that the Temperature correlates with Fuel Price and CPI values. The result also aligns the correlation values above heatmap above.

In [48]:
xi = fp
yi = temp

No_bootstrap_trial = len(temp)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Temperature vs Fuel Price is", p_val[0,1])
The Z value for Temperature vs Fuel Price is 11.676566479392918
In [49]:
xi = cpi
yi = temp

No_bootstrap_trial = len(temp)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Temperature vs CPI is", p_val[0,1])
The Z value for Temperature vs CPI is 14.266627971110935

Hypotheses Test: (Fuel Price vs CPI)


  • Null Hypothesis H0: Fuel Price completely NOT correlates with CPI.
  • Alt. Hypothesis Ha: Fuel Price correlates with CPI.

Acccording to the following Hypotheses Test for the features Temperature, CPI and Store Size with either (Z = -5) < -3 or (Z = 65) > 3 which means the p-value is more than 3 standard deviation below the mean. In other words, it shows more than 99.7% of chance the correlation of samples are not the same as the population mean. So that we have strong evidence to reject the Null Hypotheses and proves that the Temperature correlates with Fuel Price and CPI values. The result also aligns the correlation values above heatmap above.

In [50]:
xi = cpi
yi = fp

No_bootstrap_trial = len(fp)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Fuel Price vs CPI is", p_val[0,1])
The Z value for Fuel Price vs CPI is -13.646755720403766

Hypothesis Summary


The above Hypothesis tests are able to give us some hints how the independent features react with each other and how the correlations between dependent and independent features. The most correlated feature against Weekly Sales (i.e. independent variable) is Store Size and the rest of independent features are comparably less correlated. In this part, we definitly give ourselves some insights prior to the machine learning part. Based on this analysis, we could iterate models with more features combinations in order to search out the most efficient and outperformed model.

Findings from Data Analysis


  • NOT all the Predictor variables are informative for the analysis due to huge amount of missing values which are dropped out (i.e. markdown1 - markdown5) for the data analysis and further predictive modeling.

  • Most of the instances are remained even though outliers values because those data might be reflecting seasonal patterns for the Target variable (i.e. Weekly Sales), e.g. Store sales are constantly higher in holiday seasons

  • Corrlations: Most of features, except Fuel_Price, are correlated with Target variable: Store Weekly Sales.

  • Corrlations: Predictor variables are mostly correlated with each other except Store Size vs Fuel Price.

Modeling and Forcast


  • Utilizes the above analyzed features to build predictive model forecasting the Weekly Sales (continuous) in terms of single store or department
  • Grid search out the best model based on the various measurment metrics (e.g. MSE etc).
  • Proposed Methodologies for modeling:
    • Single Regressor: Linear Regression, k-nearest neighbor, decision tree
    • Ensemble Regressor: Random Forest Regressor, Xgb Regressor
    • Deep Learning: LSTM, Convention 1D, GRU